Search CORE

10 research outputs found

Recommended from our members

A Nearest-Neighbor Approach to Indicative Web Summarization

Author: Petinot Yves
Publication venue: 'Columbia University Libraries/Information Services'
Publication date: 01/01/2016
Field of study

Through their role of content proxy, in particular on search engine result pages, Web summaries play an essential part in the discovery of information and services on the Web. In their simplest form, Web summaries are snippets based on a user-query and are obtained by extracting from the content of Web pages. The focus of this work, however, is on indicative Web summarization, that is, on the generation of summaries describing the purpose, topics and functionalities of Web pages. In many scenarios — e.g. navigational queries or content-deprived pages — such summaries represent a valuable commodity to concisely describe Web pages while circumventing the need to produce snippets from inherently noisy, dynamic, and structurally complex content. Previous approaches have identified linking pages as a privileged source of indicative content from which Web summaries may be derived using traditional extractive methods. To be reliable, these approaches require sufficient anchortext redundancy, ultimately showing the limits of extractive algorithms for what is, fundamentally, an abstractive task. In contrast, we explore the viability of abstractive approaches and propose a nearest-neighbors summarization framework leveraging summaries of conceptually related (neighboring) Web pages. We examine the steps that can lead to the reuse and adaptation of existing summaries to previously unseen pages. Specifically, we evaluate two Text-to-Text transformations that cover the main types of operations applicable to neighbor summaries: (1) ranking, to identify neighbor summaries that best fit the target; (2) target adaptation, to adjust individual neighbor summaries to the target page based on neighborhood-specific template-slot models. For this last transformation, we report on an initial exploration of the use of slot-driven compression to adjust adapted summaries based on the confidence associated with token-level adaptation operations. Overall, this dissertation explores a new research avenue for indicative Web summarization and shows the potential value, given the diversity and complexity of the content of Web pages, of transferring, and, when necessary, of adapting, existing summary information between conceptually similar Web pages

Columbia University Academic Commons

A Hierarchical Model of Web Summaries

Author: Kapil Thadani
Kathleen Mckeown
Yves Petinot
Publication venue
Publication date: 01/01/2011
Field of study

We investigate the relevance of hierarchical topic models to represent the content of Web gists. We focus our attention on DMOZ, a popular Web directory, and propose two algorithms to infer such a model from its manually-curated hierarchy of categories. Our first approach, based on information-theoretic grounds, uses an algorithm similar to recursive feature selection. Our second approach is fully Bayesian and derived from the more general model, hierarchical LDA. We evaluate the performance of both models against a flat 1-gram baseline and show improvements in terms of perplexity over held-out data.

CiteSeerX

Columbia University Academic Commons

eBizSearch: an OAI-Compliant Digital Library for eBusiness

Author: Arvind Rangaswamy
C. Lee Giles
C. Lee Giles
Hui Han
Hui Han
Nirmal Pal
Pradeep B. Teregowda
Pradeep B. Teregowda
Steve Lawrence
Steve Lawrence
Yves Petinot
Yves Petinot
Publication venue
Publication date
Field of study

{petinot, hhan

CiteSeerX

Enabling Interoperability For Autonomous Digital Libraries : An API To CiteSeer Services

Author: C. Lee Giles
Hui Han
Pradeep B. Teregowda
Vivek Bhatnagar
Yves Petinot
Publication venue
Publication date
Field of study

We introduce CiteSeer-API, a public API to CiteSeer-like services. CiteSeer-API is SOAP/WSDL based and allows for easy programatical access to all the specific functionalities offered by CiteSeer services, including full text search of documents and citations and citation-based document discovery. CiteSeer-API is currently showcased on SMEALSearch [10], a digital library search engine for business academic publications

CiteSeerX

A Service-Oriented Architecture for Digital Libraries

Author: C. Lee Giles
Hui Han
Isaac G. Councill
Pradeep B. Teregowda
V. Bhatnagar
Vivek Bhatnagar
Yves Petinot
Publication venue
Publication date: 01/01/2004
Field of study

CiteSeer is currently a very large source of meta-data information on the World Wide Web (WWW). This meta-data is the key material for the Semantic Web. Still, CiteSeer is not yet a Semantic-enabled service and therefore its meta-data, although potentially usable by Semantic Web agents, is not yet reachable using the Semantic Web mechanisms. The complexity of CiteSeer, that is the range of tasks it supports, make the transition to a Semantic-enabled service a non-trivial task. While human users tend to perceive CiteSeer as a single well-integrated service, we believe it is best seen -- from a machine perspective -- as a collection of services, each service performing a specific task. In this paper we show our approach to enable CiteSeer on the Semantic Web in order to allow the use of its meta-data through the Semantic Web. We first introduce an intuitive Application Programming Interface (API) to the CiteSeer software, then show that an efficient integration of CiteSeer in the Semantic Web can be best achieved by independently integrating the services that comprise it. We believe the effort presented here towards the Semantic-integration of a complex Information Retrieval system could be used as an integration model for arbitrary systems

CiteSeerX

Crossref

Citeseer-api: towards seamless resource location and interlinking for digital libraries

Author: C. Lee Giles
Hui Han
Isaac G. Councill
Pradeep B. Teregowda
V. Bhatnagar
Vivek Bhatnagar
Yves Petinot
Publication venue: ACM Press
Publication date: 01/01/2004
Field of study

We introduce CiteSeer-API, a public API to CiteSeer-like services. CiteSeer-API is SOAP/WSDL based and allows for easy programmatical access to all the specific functionalities offered by CiteSeer services, including full text search of documents and citations and citation-based document discovery. In order to enable operability and interlinking with arbitrary software agents and digital library systems, CiteSeer-API uses digital content signatures to create system-independent handles for the Document, Citation and Group resources of CiteSeer servers. We discuss specific functionalities of CiteSeer-API that take advantage of these handlers in order to enable seamless location of CiteSeer resources. Finally we argue that the digital signature scheme used by CiteSeer-API is well suited for the creation of machine-usable semantic descriptions of digital library services which is the key toward seamless discovery and integration of services such as CiteSeer-API. CiteSeer-API is currently showcased on CiteSeer.IST, the CiteSeer server of the School o

CiteSeerX

eBizSearch: A Niche Search Engine for e-Business

Author: Arvind Rangaswamy
C. Lee Giles
Hui Han
Nirmal Pal
Pradeep B. Teregowda
Steve Lawrence
Yves Petinot
Publication venue: ACM
Publication date
Field of study

Niche Search Engines offer an efficient alternative to traditional search engines when the results returned by general-purpose search engines do not provide a sufficient degree of relevance. By taking advantage of their domain of concentration they achieve higher relevance and offer enhanced features. We discuss a new niche search engine, eBizSearch, based on the technology of CiteSeer and dedicated to e-business and e-business documents. We present the integration of CiteSeer in the framework of eBizSearch and the process necessary to tune the whole system towards the specific area of e-business. We also discuss how using machine learning algorithms we generate metadata to make eBizSearch Open Archives compliant. eBizSearch is a publicly available service and can be reached at [3]

CiteSeerX